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Abstract 

We study distributional similarity measures for 
the purpose of improving probability estima- 
tion for unseen cooccurrences. Our contribu- 
tions are three-fold: an empirical comparison 
of a broad range of measures; a classification 
of similarity functions based on the information 
that they incorporate; and the introduction of 



a novel tunction that is superior at evaluating 
potential proxy distributions. 

1 Introduction 

An inherent problem for statistical methods in 
natural language processing is that of sparse 



data the inaccurate representation in any 



Liaiiiiiig coipuH uf Llie piubabiliLy of low fie- 
quency events. In particular, reasonable events 
that happen to not occur in the training set may 
mistakenly be assigned a probability of zero. 
These unseen events generally make up a sub- 



stantial portion of novel data; for example, Es- 
sen and Steinbiss (1992| ) report that 12% of the 
test-set bigrams in a 75%-25% split of one mil- 
lion words did not occur in the training parti- 
tion. 

We consider here the question of how to es- 
timate the conditional cooccurrence probability 
P{v\n) of an unseen word pair (n, v) drawn from 
some finite set N x V. Two state-of-the-art 
technologies are Katz's ( p^) b ackoff method 
and Jelinek and Mercer's ( |1980| ) interpolation 
method. Both use P{v) to estimate P{v\n) 
when (n, v) is unseen, essentially ignoring the 
identity of n. 

An alternative approach is distance-weighted 
averaging, which arrives at an estimate for un- 
seen cooccurrences by combining estimates for 



cooccurrences involving similar words 
EmG5(n) sim{n,m)P{v\m) 



P{v\n) 



EmG5{n) sim{n, m) 



(1) 



where S{n) is a set of candidate similar words 
and sim(n, m) is a function of the similarity 
between n and m. We focus on distributional 
rather than semantic similarity (e.g., Resnik 



(1995)) because the goal of distance- weighted 



averaging is to smooth probability distributions 
— although the words "chance" and "probabil- 
ity" are synonyms, the former may not be a 
good model for predicting what cooccurrences 
the latter is likely to participate in. 

There are many plausible measures of distri- 
butional similarity. In previous work (Dagan 



et al., 1999| ), we compared the performance of 
three different functions: the Jensen-Shannon 
divergence (total divergence to the average), the 
Li norm, and the confusion probability. Our 
experiments on a frequency-controlled pseu- 
doword disambiguation task showed that using 
any of the three in a distance-weighted aver- 
aging scheme yielded large improvements over 
Katz's backoff smoothing method in predicting 
unseen coocurrences. Furthermore, by using a 
restricted version of model (|l|) that stripped in- 
comparable parameters, we were able to empir- 
ically demonstrate that the confusion probabil- 
ity is fundamentally worse at selecting useful 
similar words. D. Lin also found that the choice 
of similarity function can affect the quality of 
automatically-constructed thesauri to a statis- 
tically significant degree (1998a) and the ability 
to determine common morphological roots by as 
much as 49% in precision ( |l998b| ). 



^The term "similarity-based", which we have used 
previously, has been applied to describe other models 
as well (L. Lee, 1997; Karov and Edelman, 1998). 



These empirical results indicate that investi- 
gating different similarity measures can lead to 
improved natural language processing. On the 
other hand, while there have been many sim- 
ilarity measures proposed and analyzed in the 
information retrieval literature ( pones and Fur-| 
nas, 1987| ), there has been some doubt expressed 



in that community that the choice of similarity 
metric has any practical impact: 

Several authors have pointed out that 
the difference in retrieval performance 
achieved by different measures of asso- 
ciation is insignificant, providing that 
t hese are appropriat ely normalised. 
([van Rijsbergen, 1979| , pg. 38) 

But no contradiction arises because, as van Rijs- 
bergen continues, "one would expect this since 
most measures incorporate the same informa- 
tion". In the language-modeling domain, there 
is currently no agreed-upon best similarity met- 
ric because there is no agreement on what the 
"same information" — the key data that a sim- 
ilarity function should incorporate — is. 

The overall goal of the work described here 
was to discover these key characteristics. To 
this end, we first compared a number of com- 
mon similarity measures, evaluating them in a 
parameter-free way on a decision task. When 
grouped by average performance, they fell into 
several coherent classes, which corresponded to 
the extent to which the functions focused on 
the intersection of the supports (regions of posi- 
tive probability) of the distributions. Using this 
insight, we developed an information-theoretic 
metric, the skew divergence, which incorporates 
the support-intersection data in an asymmetric 
fashion. This function yielded the best perfor- 
mance overall: an average error rate reduction 
of 4% (significant at the .01 level) with respect 
to the Jensen-Shannon divergence, the best pre- 
dictor of unseen events in our earlier experi- 
ments ( pagan et al., 1999| ). 

Our contributions are thus three-fold: an em- 
pirical comparison of a broad range of similarity 
metrics using an evaluation methodology that 
factors out inessential degrees of freedom; a pro- 
posal, building on this comparison, of a charac- 
teristic for classifying similarity functions; and 
the introduction of a new similarity metric in- 
corporating this characteristic that is superior 
at evaluating potential proxy distributions. 



2 Distributional Similarity Functions 

In this section, we describe the seven distri- 
butional similarity functions we initally evalu- 
ated.0 For concreteness, we choose and V 
to be the set of nouns and the set of transitive 
verbs, respectively; a cooccurrence pair (n, v) 
results when n appears as the head noun of the 
direct object of v. We use P to denote probabil- 
ities assigned by a base language model (in our 
experiments, we simply used unsmoothed rel- 
ative frequencies derived from training corpus 
counts). 

Let n and m be two nouns whose distribu- 
tional similarity is to be determined; for nota- 
tional simplicity, we write q{v) for P(v\n) and 
r{v) for P{v\m), their respective conditional 
verb cooccurrence probabilities. 

Figure Q lists several familiar functions. The 
cosine metric and Jaccard's coefficient are com- 
monly used in information retrieval as measures 
of association ( palton and McGill, 198^ ). Note 
that Jaccard's coefficient differs from all the 
other measures we consider in that it is essen- 
tially combinatorial, being based only on the 
sizes of the supports of q, r, and q ■ r rather 
than the actual values of the distributions. 

Previously, we found the Jensen- Shannon di- 
vergence (Rao, 1982; J. Lin, 1991) to be a useful 
measure of the distance between distributions: 



JSiq,r) 



1 



Diq 



av£ 



av£ 



q,r 



The function D is the KL divergence, which 
measures the (always nonnegative) average in- 
efficiency in using one distribution to code for 
another ( Cover and Thomas, 199"l| ): 



D{p^{V)\\p2iV))=J2Pii^)'^og 



Pl[V) 
P2{v) 



The function avg^^^ denotes the average distri- 
bution avgg^^(f) = {q{v) +r{v))/2; observe that 
its use ensures that the Jensen-Shannon diver- 
gence is always defined. In contrast, D{q\\r) is 
undefined if q is not absolutely continuous with 
respect to r (i.e., the support of q is not a subset 
of the support of r). 

^Strictly speaking, some of these functions are dissim- 
ilarity measures, but each such function / can be recast 
as a similarity function via the simple transformation 
C — /, where C is an appropriate constant. Whether we 
mean / or C — / should be clear from context. 



Euclidean distance L2{q,r) 
Li norm Li{q, r) 



cosine cos ((7, r) 



Jaccard's coefficient Jac{q, r) 



-r{v)\ 

V 



\{v : q{v) > and r{v) > 0}[ 
\{v I q{v) > or r{v) > 0}\ 



Figure 1: Well-known functions 



The confusion probability has been used by 
several authors to smooth word cooccurrence 
probabilities (^gawara et al., 198E ; Essen and 



|Steinbiss, 1992| [Grishman and Sterling, 1993 ); 
it measures the degree to which word m can 
be substituted into the contexts in which n ap- 
pears. If the base language model probabili- 
ties obey certain Bayesian consistency condi- 
tions ( pagan et al., 1999| ), as is the case for 
relative frequencies, then we may write the con- 
fusion probability as follows: 



conf (g, r, P{m)) = ^ q{v)r{v) 



P{m) 
P(v) 



Note that it incorporates unigram probabilities 
as well as the two distributions q and r. 



Finally. Kendall's r. which appears in work 



on clustering similar adjectives (Hatzivassilo- 
glou and McKeown, 1993; Hatzivassiloglou, 
1996), is a nonparametric measure of the as- 
sociation between random variables (Gibbons,! 



1993). In our context, it looks for correlation 



between the behavior of q and r on pairs of 
verbs. Three versions exist; we use the simplest. 



here: 



sign \{q{vi) - q{v2)){r{vi) - r{v2))] 



r, then T{q,r) = 1; if it yields exactly the op- 
posite ordering, then T{q,r) = —1. We treat a 
value of — 1 as indicating extreme dissimilarity.^ 
It is worth noting at this point that there 
are several well-known measures from the NLP 
literature that we have omitted from our ex- 
periments. Arguably the most widely used is 
the mutual information (Hindle, 1990; Church 
and Hanks, 1990; Dagan et al., 1995; Luk, 
1995; D. Lin, 1998a). It does not apply in 
the present setting because it does not mea- 
sure the similarity between two arbitrary prob- 
ability distributions (in our case, P{V\n) and 
P{V\m)), but rather the similarity between 
a joint distribution P{Xi,X2) and the cor- 
responding product distribution P{Xi)P{X2). 
Hamming- type metrics ( Pardie, 1993| ; |Zavrel| 
and Daelemans, 1997] ) are intended for data 
with symbolic features, since they count fea- 
ture label mismatches, whereas we are deal- 
ing feature values that are probabilities. Varia- 
tions of the value difference metric ( Stanfill and 
Waltz, 198^ ) have been employed for supervised 
disambiguation (Ng and H.B. Lee, 1996; Ng, 
1997); but it is not reasonable in language mod- 
eling to expect training data tagged with cor- 
rect probabilities. The Dice coefficient (Smadja 
et al., 1996; D. Lin, 1998a, 1998b) ( |Kay and| 



where sign(2;) is 1 for positive arguments, —1 
for negative arguments, and at 0. The intu- 
ition behind Kendall's r is as follows. Assume 
all verbs have distinct conditional probabilities. 
If sorting the verbs by the likelihoods assigned 
by q yields exactly the same ordering as that 
which results from ranking them according to 



Roscheisen, 1993 ) is monotonic in Jaccard's co- 



efficient (van Rijsbergen, 1979), so its inclusion 
in our experiments would be redundant. Fi- 
nally, we did not use the KL divergence because 
it requires a smoothed base language model. 



•^Zero would also be a reasonable choice, since it in- 
dicates zero correlation between q and r. However, it 
would then not be clear how to average in the estimates 
of negatively correlated words in equation (fy). 



3 Empirical Comparison 

We evaluated the similarity functions intro- 
duced in the previous section on a binary dec- 
ision task, using the same experimental frame- 
work as in our previous preliminary compari- 
son ( Pagan et al., 1999| ). That is, the data 
consisted of the verb-object cooccurrence pairs 
in the 1988 Associated Press newswire involv- 
ing the 1000 most frequent nouns, extracted 
via Church's (1988) and Yarowsky's process- 
ing tools. 587,833 (80%) of the pairs served 
3jS Si training set from which to calculate base 
probabilities. From the other 20%, we pre- 
pared test sets as follows: after discarding pairs 
occurring in the training data (after all, the 
point of similarity-based estimation is to deal 
with unseen pairs), we split the remaining pairs 
into five partitions, and replaced each noun- 
verb pair {n,vi) with a noun- verb- verb triple 
{n,vi,V2) such that P{v2) ^ Pi^i). The task 
for the language model under evaluation was 
to reconstruct which of {n,vi) and {n,V2) was 
the original cooccurrence. Note that by con- 
struction, (n, vi) was always the correct answer, 
and furthermore, methods relying solely on uni- 
gram frequencies would perform no better than 
chance. Test-set performance was measured by 
the error rate, defined as 

— (# of incorrect choices -|- (# of ties)/2) , 

where T is the number of test triple tokens in 
the set, and a tie results when both alternatives 
are deemed equally likely by the language model 
in question. 

To perform the evaluation, we incorporated 
each similarity function into a simple decision 
rule as follows. As above, let {n,vi,V2) be a 
test instance. For a given similarity measure 
/ and neighborhood size k, let 5/^fc(n) denote 
the k most similar words to n according to /. 
We define the evidence Ef^].{n,vi) for vi as the 
number of neighbors m E Sf^k{n) such that 
P{vi\m) > P{v2\'m)] similarly, the evidence for 
V2 is the number of the k closest neighbors that 
favor V2 over vi. Then, the decision rule is to 
choose the verb alternative with the greatest ev- 
idence. 

The reason we used a restricted version of the 
distance-weighted averaging model was that we 
sought to discover fundamental differences in 



behavior. Because we have a binary decision 
task, i?j fc(n, simply counts the number of k 
nearest neighbors to n that make the right de- 
cision. If we have two functions / and g such 
that -E/^fc(n, ui) > £'g^fe(n, vi), then the k most 
similar words according to / are on the whole 
better predictors than the k most similar words 
according to g] hence, / induces an inherently 
better similarity ranking for distance-weighted 
averaging. The difficulty with using the full 
model (Equation ([l|)) for comparison purposes 
is that fundamental differences can be obscured 
by issues of weighting. For example, suppose 
the probability estimate X]d(2 — Li{q,r)) ■ r{v) 
(suitably normalized) performed poorly. We 
would not be able to tell whether the cause 
was an inherent deficiency in the Li norm or 
just a poor choice of weight function — per- 
haps (2 — Li((jf, r))^ would have yielded better 
estimates. 

Figure |2| shows how the average error rate 
varies with k for the seven similarity metrics 
introduced above. As previously mentioned, a 
steeper slope indicates a better similarity rank- 
ing. 

All the curves have a generally upward trend 
but always lie far below backoff (51% error 
rate). They meet at = 1000 because 5/,iooo(^) 
is always the set of all nouns. We see that the 
functions fall into four groups: (1) the L2 norm; 
(2) Kendall's r; (3) the confusion probability 
and the cosine metric; and (4) the Li norm, 
Jensen- Shannon divergence, and Jaccard's co- 
efficient. 

We can account for the similar performance 
of various metrics by analyzing how they incor- 
porate information from the intersection of the 
supports of q and r. (Recall that we are using 
q and r for the conditional verb cooccurrrence 
probabilities of two nouns n and m.) Consider 
the following supports (illustrated in Figure |3[): 



Vr 



qr 



{veV 
{veV 
{veV 



q{v) > 0} 
r{v) > 0} 

q{v)r{v) > 0} = n y,. 



We can rewrite the similarity functions from 
Section |2| in terms of these sets, making use 
of the identities J2v&v^\v,r + S^»;e V = 

J2veVr\V,r '^i'^) + E^,e V ^^^^ = ^^^^^ S 

these alternative forms in order of performance. 



Error rates (averages and ranges) 




Figure 2: Similarity metric performance. Errorbars denote the range of error rates over the five 
test sets. Backoff's average error rate was 51%. 



L2{q,r) 


= lJ2q{vy - 2J2q{v)r{v) + J2r{v)'^ 


r(g,r). 2(1^1) 


= 2 \V,r\ \V \ {V^ U Vr)\ -2\V,\ Vqr\ \Vr \ V,r\ 

+ E E sign[{q{vi) - q{v2)){r{vi) - r{v2))] 
+ E E sign[(g(ui) - g(?;2))(r(?;i) - r(t!2))] 


conf (g, r, P{m)) 


= P{m) J2 q{v)r{v) / P{v) 


cos(g, r) 


= E q{v)r{v){ E q{vf E r(t;)2)-V2 


Li{q,r) 


= 2- E - r(-u)| - - r(v)) 

VGVqr 


JS{q,r) 


= log2 + i J2 {h{q{v) + r{v)) — h{q{v)) — h{r{v))) , h{x) = —xlogx 

VGVqr 


Jac{q, r) 


= \Vgr\/\VgUVr\ 



Table 1: Similarity functions, written in terms of sums over supports and grouped by average 
performance. \ denotes set difference; A denotes symmetric set difference. 



We see that for the non-combinatorial functions, 
the groups correspond to the degree to which 
the measures rely on the verbs in Vqr- The 
Jensen-Shannon divergence and the Li norm 
can be computed simply by knowing the val- 
ues of q and r on Vgr- For the cosine and the 



confusion probability, the distribution values on 
Vqr are key, but other information is also incor- 
porated. The statistic takes into account all 
verbs, including those that occur neither with 
n nor m. Finally, the Euclidean distance is 



quadratic in verbs outside Vqr', indeed, Kaufman 




Figure 3: Supports on V 

and Rousseeuw (1990) note that it is "extremely 
sensitive to the effect of one or more outhers" 
(pg. 117). 

The superior performance of Jac(g, r) seems 
to underscore the importance of the set Vqr- 
Jaccard's coefficient ignores the values of q and 
r on Vqr', but we see that simply knowing the 
size of Vqr relative to the supports of q and r 
leads to good rankings. 

4 The Skew Divergence 

Based on the results just described, it appears 
that it is desirable to have a similarity func- 
tion that focuses on the verbs that cooccur with 
both of the nouns being compared. However, 
we can make a further observation: with the 
exception of the confusion probability, all the 
functions we compared are symmetric, that is, 
/(?)'') = fi^jQ)- But the substitutability of 
one word for another need not symmetric. For 
instance, "fruit" may be the best possible ap- 
proximation to "apple", but the distribution of 
"apple" may not be a suitable proxy for the dis- 
tribution of "fruit" .0 

In accordance with this insight, we developed 
a novel asymmetric generalization of the KL di- 
vergence, the a-skew divergence: 

Sa{q, r) = D{r \\a ■ q + {1 - a) ■ r) 

for < a < 1. It can easily be shown that 5^ 
depends only on the verbs in Vqr- Note that at 
a = 1, the skew divergence is exactly the KL di- 
vergence, and Si/2 is twice one of the summands 
of JS (note that it is still asymmetric). 

We can think of a as a degree of confidence 
in the empirical distribution q; or, equivalently. 



(1 — a) can be thought of as controlling the 
amount by which one smooths q hy r. Thus, 
we can view the skew divergence as an approx- 
imation to the KL divergence to be used when 
sparse data problems would cause the latter 
measure to be undefined. 

Figure § shows the performance of Sa for 
a = .99. It performs better than all the other 
functions; the difference with respect to Jac- 
card's coefficient is statistically significant, ac- 
cording to the paired t-test, at all k (except 
k = 1000), with significance level .01 at all k 
except 100, 400, and 1000. 

5 Discussion 

In this paper, we empirically evaluated a num- 
ber of distributional similarity measures, includ- 
ing the skew divergence, and analyzed their in- 
formation sources. We observed that the ability 
of a similarity function f{q,r) to select useful 
nearest neighbors appears to be correlated with 
its focus on the intersection Vqr of the supports 
of q and r. This is of interest from a computa- 
tional point of view because Vqr tends to be a 
relatively small subset of V, the set of all verbs. 
Furthermore, it suggests downplaying the role of 
negative information, which is encoded by verbs 
appearing with exactly one noun, although the 
Jaccard coefficient does take this type of infor- 
mation into account. 

Our explicit division of F-space into vari- 
ous support regions has been implicitly con- 
sidered in other work, ^madja et al. (1996| ) 
observe that for two potential mutual transla- 
tions X and Y, the fact that X occurs with 
translation Y indicates association; X^s occur- 
ring with a translation other than Y decreases 
one's belief in their association; but the absence 
of both X and Y yields no information. In 
essence, Smadja et al. argue that information 
from the union of supports, rather than the just 



the intersection, is important. D. Lin (1997 
1998a| ) takes an axiomatic approach to deter- 



*On a related note, an anonymous reviewer cited the 
following example from the psychology literature: we can 
say Smith's lecture is like a sleeping pill, but "not the 
other way round" . 



mining the characteristics of a good similarity 
measure. Starting with a formalization (based 
on certain assumptions) of the intuition that the 
similarity between two events depends on both 
their commonality and their differences, he de- 
rives a unique similarity function schema. The 
definition of commonality is left to the user (sev- 
eral different definitions are proposed for differ- 
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Figure 4: Performance of the skew divergence with respect to the best functions from Figure ^ 



ent tasks). 

We view the empirical approach taken in this 
paper as complementary to Lin's. That is, we 
are working in the context of a particular appli- 
cation, and, while we have no mathematical cer- 
tainty of the importance of the "common sup- 
port" information, we did not assume it a priori; 
rather, we let the performance data guide our 
thinking. 

Finally, we observe that the skew metric 
seems quite promising. We conjecture that ap- 
propriate values for a may inversely correspond 
to the degree of sparseness in the data, and 
intend in the future to test this conjecture on 
larger-scale prediction tasks. We also plan to 
evaluate skewed versions of the Jensen-Shannon 
divergence proposed by Rao (1982 ) and J. Lin 
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